Tagging spoken corpus

نویسندگان

  • Yue-Shi Lee
  • Hsin-Hsi Chen
چکیده

Spoken languages are more flexible in usage than written languages. Thus, tagging spoken corpus differs from tagging traditional written corpus. This paper proposes a new framework for tagging spoken corpus. The framework adopts the written tagger to process spoken data with the special consideration of the characteristics of spoken language. Besides, the problems of different tagging sets between the written and spoken corpora are also considered in the framework. The presented approach makes an attempt at reducing the differences between these two kinds of language systems and the preliminary tests give very encouraging results.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A semantic tagging tool for spoken dialogue corpus

In this paper, we report our semantic tagging tool for spoken dialogue corpus. This tagging tool can acquire analysis rules using Transformation-based Learning (TBL) from small scale training corpus. It can learn dialogue act tagging rules and semantic frame tagging rules. The precisions are 72% in dialogue act tagging and 58% of semantic frame tagging in open test.

متن کامل

Layered Speech-Act Annotation for Spoken Dialogue Corpus

This paper describes the design of speech act tags for spoken dialogue corpora and its evaluation. Compared with the tags used for conventional corpus annotation, the proposed speech intention tag is specialized enough to determine system operations. However, detailed information description increases tag types. This causes an ambiguous tag selection. Therefore, we have designed an organization...

متن کامل

Tagging a Corpus of Spoken Swedish

In this article, we present and evaluate a method for training a statistical partof-speech tagger on data from written language and then adapting it to the requirements of tagging a corpus of transcribed spoken language, in our case spoken Swedish. This is currently a significant problem for many research groups working with spoken language, since the availability of tagged training data from s...

متن کامل

Part of Speech Tagging and Lemmatisation for the Spoken Dutch Corpus

This paper describes the lemmatisation and tagging guidelines developed for the “Spoken Dutch Corpus”, and lays out the philosophy behind the high granularity tagset that was designed for the project. To bootstrap the annotation of large quantities of material (10 million words) with this new tagset we tested several existing taggers and tagger generators on initial samples of the corpus. The r...

متن کامل

DisMo: A Morphosyntactic, Disfluency and Multi-Word Unit Annotator. An Evaluation on a Corpus of French Spontaneous and Read Speech

We present DisMo, a multi-level annotator for spoken language corpora that integrates part-of-speech tagging with basic disfluency detection and annotation, and multi-word unit recognition. DisMo is a hybrid system that uses a combination of lexical resources, rules, and statistical models based on Conditional Random Fields (CRF). In this paper, we present the first public version of DisMo for ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999